Imporitng relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
pip install lime
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: lime in /usr/local/lib/python3.9/dist-packages (0.2.0.1) Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from lime) (1.10.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (from lime) (3.7.1) Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.9/dist-packages (from lime) (1.2.2) Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from lime) (4.65.0) Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from lime) (1.22.4) Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.9/dist-packages (from lime) (0.19.3) Requirement already satisfied: imageio>=2.4.1 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (2.25.1) Requirement already satisfied: pillow!=7.1.0,!=7.1.1,!=8.3.0,>=6.1.0 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (8.4.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (23.0) Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (2023.3.21) Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (3.0) Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (1.4.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn>=0.18->lime) (3.1.0) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn>=0.18->lime) (1.1.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (1.0.7) Requirement already satisfied: importlib-resources>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (5.12.0) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (2.8.2) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (3.0.9) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (4.39.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (1.4.4) Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.9/dist-packages (from importlib-resources>=3.2.0->matplotlib->lime) (3.15.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.7->matplotlib->lime) (1.16.0)
import lime
from lime import lime_tabular
#imporitng monthly table
df = pd.read_csv("/content/monthly_table.csv")
df
| CLNT_NO | ME_DT | mth_txn_amt_sum | mth_txn_cnt | amt_sum_3M | amt_mean_3M | amt_max_3M | txn_cnt_sum_3M | txn_cnt_mean_3M | txn_cnt_max_3M | ... | cnt_Monday | cnt_Saturday | cnt_Sunday | cnt_Thursday | cnt_Tuesday | cnt_Wednesday | last_monthly_purchase | days_since_last_txn | customer_id | response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CS1112 | 2011-05-31 | 72 | 1 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CS1112 | 0 |
| 1 | CS1112 | 2011-06-30 | 56 | 1 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 2011-06-15 00:00:00 | 15 | CS1112 | 0 |
| 2 | CS1112 | 2011-07-31 | 72 | 1 | 200 | 66.666667 | 72 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-06-15 00:00:00 | 46 | CS1112 | 0 |
| 3 | CS1112 | 2011-08-31 | 96 | 1 | 224 | 74.666667 | 96 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-08-19 00:00:00 | 12 | CS1112 | 0 |
| 4 | CS1112 | 2011-09-30 | 72 | 1 | 240 | 80.000000 | 96 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-08-19 00:00:00 | 42 | CS1112 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 323543 | CS9000 | 2014-11-30 | 72 | 1 | 216 | 72.000000 | 72 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-08-24 00:00:00 | 98 | CS9000 | 0 |
| 323544 | CS9000 | 2014-12-31 | 72 | 1 | 216 | 72.000000 | 72 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-08-24 00:00:00 | 129 | CS9000 | 0 |
| 323545 | CS9000 | 2015-01-31 | 72 | 1 | 216 | 72.000000 | 72 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-08-24 00:00:00 | 160 | CS9000 | 0 |
| 323546 | CS9000 | 2015-02-28 | 34 | 1 | 178 | 59.333333 | 72 | 3 | 1.0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 2015-02-28 00:00:00 | 0 | CS9000 | 0 |
| 323547 | CS9000 | 2015-03-31 | 72 | 1 | 178 | 59.333333 | 72 | 3 | 1.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-28 00:00:00 | 31 | CS9000 | 0 |
323548 rows × 33 columns
In Feb-2014, clients CS1350 and CS1200 emailed my customer service department complaining about the company’s decision to market to them (or the lack of it). Hence, I will collect all the data for the said customers up untill January 2014.
#Filtering out data untill January 2014
df_new = df[:] #new dataframe to filter out the required data
df_new = df_new[(df_new['ME_DT'] < '2014-01-31')]
#Filtering out data for interested clients up until Jan 2014
clients = ["CS1350","CS1200"]
df_new = df_new[df_new["CLNT_NO"].isin(clients)]
df_new.reset_index(drop=True, inplace=True)
df_new
| CLNT_NO | ME_DT | mth_txn_amt_sum | mth_txn_cnt | amt_sum_3M | amt_mean_3M | amt_max_3M | txn_cnt_sum_3M | txn_cnt_mean_3M | txn_cnt_max_3M | ... | cnt_Monday | cnt_Saturday | cnt_Sunday | cnt_Thursday | cnt_Tuesday | cnt_Wednesday | last_monthly_purchase | days_since_last_txn | customer_id | response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CS1200 | 2011-05-31 | 72 | 1 | 216 | 72.000000 | 72 | 3 | 1.000000 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CS1200 | 0 |
| 1 | CS1200 | 2011-06-30 | 94 | 1 | 238 | 79.333333 | 94 | 3 | 1.000000 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-06-03 00:00:00 | 27 | CS1200 | 0 |
| 2 | CS1200 | 2011-07-31 | 72 | 1 | 238 | 79.333333 | 94 | 3 | 1.000000 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-06-03 00:00:00 | 58 | CS1200 | 0 |
| 3 | CS1200 | 2011-08-31 | 72 | 1 | 238 | 79.333333 | 94 | 3 | 1.000000 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2011-06-03 00:00:00 | 89 | CS1200 | 0 |
| 4 | CS1200 | 2011-09-30 | 170 | 2 | 314 | 104.666667 | 170 | 4 | 1.333333 | 2 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 2011-09-10 00:00:00 | 20 | CS1200 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59 | CS1350 | 2013-08-31 | 84 | 1 | 231 | 77.000000 | 84 | 3 | 1.000000 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 2013-08-04 00:00:00 | 27 | CS1350 | 1 |
| 60 | CS1350 | 2013-09-30 | 85 | 1 | 244 | 81.333333 | 85 | 3 | 1.000000 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 2013-09-29 00:00:00 | 1 | CS1350 | 1 |
| 61 | CS1350 | 2013-10-31 | 120 | 2 | 289 | 96.333333 | 120 | 4 | 1.333333 | 2 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 2013-10-14 00:00:00 | 17 | CS1350 | 1 |
| 62 | CS1350 | 2013-11-30 | 72 | 1 | 277 | 92.333333 | 120 | 4 | 1.333333 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2013-10-14 00:00:00 | 47 | CS1350 | 1 |
| 63 | CS1350 | 2013-12-31 | 72 | 1 | 264 | 88.000000 | 120 | 4 | 1.333333 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2013-10-14 00:00:00 | 78 | CS1350 | 1 |
64 rows × 33 columns
Creating train and test set
#Train set for random forest model
X_train = df_new.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)
y_train = df_new["response"]
#Test set for CS1200 and CS1350 separately
X_test_CS1200 = df_new[df_new["CLNT_NO"].isin(["CS1200"])]
X_test_CS1350 = df_new[df_new["CLNT_NO"].isin(["CS1350"])]
response_CS1200 = X_test_CS1200["response"]
response_CS1350 = X_test_CS1350["response"]
X_test_CS1200 = X_test_CS1200.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)
X_test_CS1350 = X_test_CS1350.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)
Training the model
The best random forest model that I used before had the following parameters:
#Random Forest Classifier
estimator_rf = RandomForestClassifier(random_state=42, max_depth = 5, ccp_alpha= 0.001,class_weight="balanced")
#Retraining model
estimator_rf.fit(X_train, y_train)
RandomForestClassifier(ccp_alpha=0.001, class_weight='balanced', max_depth=5,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(ccp_alpha=0.001, class_weight='balanced', max_depth=5,
random_state=42)#defining an explainer object
explainer = lime_tabular.LimeTabularExplainer(
training_data=np.array(X_train),
feature_names=X_train.columns,
class_names=['bad', 'good'],
mode='classification'
)
#original responses for CS1200
response_CS1200
0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 0 20 0 21 0 22 0 23 0 24 0 25 0 26 0 27 0 28 0 29 0 30 0 31 0 Name: response, dtype: int64
#explianing the prediction for CS1200
indices = np.random.randint(0, X_test_CS1200.shape[0])
exp = explainer.explain_instance(
data_row=X_test_CS1200.iloc[indices],
predict_fn=estimator_rf.predict_proba,
num_features=28,
top_labels = None,
distance_metric='euclidean',
num_samples=1000
)
exp.show_in_notebook(show_table=True)
Interpretation:
We see that earlier the response for this customer was negative, meaning he did not want any phone calls on promotions. However, after conducting a LIME test, we see that the black box model predicts a negative response with 97% probability.
But still the customer received the call. This can be explained by the negative influence on the model prediction by the features "amt_max_12M", txn_cnt_sum_12M,"txn_cnt_max_12M" and "mth_txn_cnt".
This may also be due to reasons like the model has got biased which is likely possible due to the imbalanced nature of the dataset or the labelling for responses might have been wrong or the model might have been trained on wrong datasets. Another plausible reason can be a human error in making the call or in enlisting the customer into phone call lists.
#original responses for CS1350
response_CS1350
32 1 33 1 34 1 35 1 36 1 37 1 38 1 39 1 40 1 41 1 42 1 43 1 44 1 45 1 46 1 47 1 48 1 49 1 50 1 51 1 52 1 53 1 54 1 55 1 56 1 57 1 58 1 59 1 60 1 61 1 62 1 63 1 Name: response, dtype: int64
#explianing the prediction for CS1350
indices = np.random.randint(0, X_test_CS1350.shape[0])
exp = explainer.explain_instance(
data_row=X_test_CS1350.iloc[indices],
predict_fn=estimator_rf.predict_proba,
num_features=28,
top_labels = None,
distance_metric='euclidean',
num_samples=1000
)
exp.show_in_notebook(show_table=True)
Interpretation:
We see that earlier the response for this customer was positive which implies that he was expeting a promotional call or consented to it. However, after conducting a LIME test, we see that the blackbox model predicts a positive response with 88% probability.
But the customer still did not receive a call. Possible similar reasons can be employed here as due to the negative influence on model predicitions by the features "amt_max_12M","amt_mean_3M","txn_cnt_max_3M","txn_cnt_sum_6M" and "txn_cnt_max_6M".
The model could also be biased or responses labelled wrongly or there might be a human error.
Overall in general, LIME results are weighted by the proximity of the sampled instances to the instance of interest, hence this can not always be accurate or sufficient enough to do the investigation. We need more detailed investigation in the matter.
pip install shap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: shap in /usr/local/lib/python3.9/dist-packages (0.41.0) Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.9/dist-packages (from shap) (0.0.7) Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from shap) (1.22.4) Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.9/dist-packages (from shap) (4.65.0) Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from shap) (1.10.1) Requirement already satisfied: numba in /usr/local/lib/python3.9/dist-packages (from shap) (0.56.4) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (from shap) (1.2.2) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from shap) (2.2.1) Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.9/dist-packages (from shap) (23.0) Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (from shap) (1.4.4) Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.9/dist-packages (from numba->shap) (0.39.1) Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from numba->shap) (67.6.1) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2022.7.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2.8.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (3.1.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
import shap
We have already trained our random forest model with dataset for these two clients. We'll create SHAP force plots for this model to explain the model.
#force plot using kernel explainer for CS1200
explainer = shap.KernelExplainer(estimator_rf.predict_proba, X_train)
shap.initjs()
shap_values = explainer.shap_values(X_test_CS1200.iloc[0,:])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_CS1200.iloc[0,:])
#force plot using tree explainer for CS1200
#plot for first instance
tree_explainer = shap.TreeExplainer(estimator_rf)
shap.initjs()
shap_values = tree_explainer.shap_values(X_train)
shap.force_plot(tree_explainer.expected_value[0], shap_values[0], X_test_CS1200.iloc[0,:].values)
Interpretation
The SHAP plot using kernel estimator clearly explains why client CS1200 receieved the call. Even if the client responded negative on receiving promotional calls, the feature "amt_max_12M" had a higher impact on pushing the model predictions towards a positive response and the size of the impact can be comprehended by the size of the bar. Too much higher values of this feature from the base value are influencing the model predictions and hence it needs to be lowered in magnitude.
The second plot represents only the instance of the first feature i.e. "amt_max_12M = 99 " and shows its impact on the model predictions.
- SHAP Plot for CS1350
#force plot using kernel explainer for CS1350
explainer = shap.KernelExplainer(estimator_rf.predict_proba, X_train)
shap.initjs()
shap_values = explainer.shap_values(X_test_CS1350.iloc[0,:])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_CS1350.iloc[0,:])
#force plot for CS1350
tree_explainer_2 = shap.TreeExplainer(estimator_rf)
shap.initjs()
shap_values_2 = tree_explainer_2.shap_values(X_train)
shap.force_plot(tree_explainer_2.expected_value[0], shap_values_2[0], X_test_CS1350.iloc[0,:].values)
Interpretation
Similarly for client 1350, the SHAP plot using kernel estimator clearly explains him/her not receieving the call. Even if the client responded positive on receiving promotional calls, the feature "amt_max_12M = 120" has a higher than the expected value but it still has negative impact on the model predictions. In addition to that, the feature "amt_mean_3M" has value much lower than the base value and is significantly impacting the model prediction. We need to readjust the values of these two features to rectify the issue.
The second plot represents only the instance of the first feature i.e. "amt_max_12M = 120 " and shows its impact on the model predictions.